The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
نویسندگان
چکیده
منابع مشابه
Text Similarity Using Google Tri-grams
The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data s...
متن کاملDependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition
Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this prob lem, several lexical, syntactic and semantic based tech niques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntac tic dependency and...
متن کاملText Reuse Detection using a Composition of Text Similarity Measures
Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived ...
متن کاملBinary-based similarity measures for categorical data and their application in Self- Organizing Maps
In exploratory data analysis of high dimensional data one Eof the main tasks is the formation of a simplified overview of data sets. Clustering and projection are among the examples of useful methods to achieve this task. However there are several types of data where the use of this measure is not adequate, such as the categorical data. In this paper we will review some of the most common binar...
متن کاملGauging Similarity with n-Grams: Language-Independent Categorization of Text.
A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is requir...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied Sciences
سال: 2019
ISSN: 2076-3417
DOI: 10.3390/app9091870